Normalized Compression Distance of Multiples

نویسندگان

  • Andrew R. Cohen
  • Paul M. B. Vitányi
چکیده

Normalized compression distance (NCD) is a parameter-free similarity measure based on compression. The NCD between pairs of objects is not sufficient for all applications. We propose an NCD of finite multisets (multiples) of objacts that is metric and is better for many applications. Previously, attempts to obtain such an NCD failed. We use the theoretical notion of Kolmogorov complexity that for practical purposes is approximated from above by the length of the compressed version of the file involved, using a real-world compression program. We applied the new NCD for multiples to retinal progenitor cell questions that were earlier treated with the pairwise NCD. Here we get significantly better results. We also applied the NCD for multiples to synthetic time sequence data. The preliminary results are as good as nearest neighbor Euclidean classifier.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Information Distance in Multiples

Information distance is a parameter-free similarity measure based on compression, used in pattern recognition, data mining, phylogeny, clustering, and classification. The notion of information distance is extended from pairs to multiples (finite lists). We study maximal overlap, metricity, universality, minimal overlap, additivity, and normalized information distance in multiples. We use the th...

متن کامل

Common Pitfalls Using the Normalized Compression Distance: What to Watch out for in a Compressor

Using the mathematical background for algorithmic complexity developed by Kolmogorov in the sixties, Cilibrasi and Vitanyi have designed a similarity distance named normalized compression distance applicable to the clustering of objects of any kind, such as music, texts or gene sequences. The normalized compression distance is a quasi-universal normalized admissible distance under certain condi...

متن کامل

Normalized Information Distance is Not Semicomputable

Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program. This practical application is called ‘normalized compression distance’ and it is trivially computable. It is a parameter-free similarity measure based on comp...

متن کامل

Nonapproximablity of the Normalized Information Distance

Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program. This practical application is called ‘normalized compression distance’ and it is trivially computable. It is a parameter-free similarity measure based on comp...

متن کامل

Normalized Distance Matrix Method for Construction of Phylogenetic Trees Using New Compressor - Dnabit Compress

We define a compression distance, based on a normal compressor to show it is an admissible distance. The first theme concerns the statistical significance of compressed file sizes. Only in recent years have scientists begun to appreciate the fact that compression ratios signify a great deal of important statistical information. In applying the approach, we have used a new DNA sequence compresso...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1212.5711  شماره 

صفحات  -

تاریخ انتشار 2012